IEOR 8100-001: Learning and Optimization for Sequential Decision Making 02/03/16 Lecture 5: Thomposon Sampling (part II): Regret bounds proofs
نویسنده
چکیده
We describe the main technical difficulties in the proof for TS algorithm as compared to the UCB algorithm. In UCB algorithm, the suboptimal arm 2 will be played at time t, if its UCB value is higher, i.e. if UCB2,t−1 > UCB1,t−1. If we have pulled arm 2 for some amount of times Ω( log(T ) ∆2 ), then with a high probability this will not happen. This is because after n2,t ≥ Ω(log(T )/∆), using concentration bounds we can derive that UCB2,t will be close to its true mean μ2. So that, with high probability.
منابع مشابه
IEOR 8100-001: Learning and Optimization for Sequential Decision Making 04/06/16 Lecture 21: Learning and optimization for sequential decision making
متن کامل
Learning and Optimization for Sequential Decision Making 02 / 01 / 16 Lecture 4 : Thompson Sampling ( part 1 )
Consider the problem of learning a parametric distribution from observations. A frequentist approach to learning considers parameters to be fixed, and uses the data learn those parameters as accurately as possible. For example, consider the problem of learning Bernoulli distribution’s parameter ( a random variable is distributed as Bernoulli(μ) is 1 with probability μ and 0 with probability 1 −...
متن کاملImprovement of Methanol Synthesis Process by using a Novel Sorption-Enhanced Fluidized-bed Reactor, Part II: Multiobjective Optimization and Decision-making Method
In the first part (Part I) of this study, a novel fluidized bed reactor was modeled mathematically for methanol synthesis in the presence of in-situ water adsorbent named Sorption Enhanced Fluidized-bed Reactor (SE-FMR) is modeled, mathematically. Here, the non-dominated sorting genetic algorithm-II (NSGA-II) is applied for multi-objective optimization of this configuration. Inlet temperature o...
متن کاملThe End of Optimism
Stochastic linear bandits are a natural and simple generalisation of finite-armed bandits with numerous practical applications. Current approaches focus on generalising existing techniques for finite-armed bandits, notably the optimism principle and Thompson sampling. Prior analysis has mostly focussed on the worst-case setting. We analyse the asymptotic regret and show matching upper and lower...
متن کامل